What is it that we are trying to solve using data science?
AirBnB is the example for Search Function, behind the scenes using Data Science and Analytics
Ref:
The dataset contains the following features:
Note: You can assume that the data is collected in the year 2016.
# Libraries to read and manipulate data
import numpy as np
import pandas as pd
# Libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To compute distances
from scipy.spatial.distance import cdist
# To compute distances
from scipy.spatial.distance import pdist
# To scale the data using z-score
from sklearn.preprocessing import StandardScaler
# To perform K-means clustering and compute Silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# To visualize an elbow curve and Silhouette scores
#from yellowbrick.cluster import SilhouetteVisualizer
# To encode the variable
from sklearn.preprocessing import LabelEncoder
# Importing PCA
from sklearn.decomposition import PCA
# Importing TSNE
from sklearn.manifold import TSNE
# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# To import K-Medoids
from sklearn_extra.cluster import KMedoids
# To import DBSCAN
from sklearn.cluster import DBSCAN
# To import Gaussian Mixture
from sklearn.mixture import GaussianMixture
# To work with datatime
from datetime import datetime
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# Reading the given dataset
data = pd.read_csv("marketing_campaign.csv")
# First firve rows of the data, just to get a feel of it.
data.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 27 columns
# Last five rows of the data
data.tail()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2235 | 10870 | 1967 | Graduation | Married | 61223.0 | 0 | 1 | 13-06-2013 | 46 | 709 | ... | 3 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2236 | 4001 | 1946 | PhD | Together | 64014.0 | 2 | 1 | 10-06-2014 | 56 | 406 | ... | 2 | 5 | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2237 | 7270 | 1981 | Graduation | Divorced | 56981.0 | 0 | 0 | 25-01-2014 | 91 | 908 | ... | 3 | 13 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2238 | 8235 | 1956 | Master | Together | 69245.0 | 0 | 1 | 24-01-2014 | 8 | 428 | ... | 5 | 10 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2239 | 9405 | 1954 | PhD | Married | 52869.0 | 1 | 1 | 15-10-2012 | 40 | 84 | ... | 1 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 27 columns
# Shape of the given data
data.shape
(2240, 27)
# Info of the given dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Response 2240 non-null int64 dtypes: float64(1), int64(23), object(3) memory usage: 472.6+ KB
Let's check for missing values and duplicates.
# Checking for missing values
data.isnull().sum()
ID 0 Year_Birth 0 Education 0 Marital_Status 0 Income 24 Kidhome 0 Teenhome 0 Dt_Customer 0 Recency 0 MntWines 0 MntFruits 0 MntMeatProducts 0 MntFishProducts 0 MntSweetProducts 0 MntGoldProds 0 NumDealsPurchases 0 NumWebPurchases 0 NumCatalogPurchases 0 NumStorePurchases 0 NumWebVisitsMonth 0 AcceptedCmp3 0 AcceptedCmp4 0 AcceptedCmp5 0 AcceptedCmp1 0 AcceptedCmp2 0 Complain 0 Response 0 dtype: int64
mv_count = data.Income.isnull().sum() # missing values count
non_mv_count = data.Income.notnull().sum() # record count with data
mv_percent = (mv_count / non_mv_count) * 100 # Percent missing values
print('Number of Missing values in Income column: ', mv_count)
print('Number of non missing values in Income column: ', non_mv_count)
print('Percentage of missing values in Income column: ', round(mv_percent, 2))
Number of Missing values in Income column: 24 Number of non missing values in Income column: 2216 Percentage of missing values in Income column: 1.08
#Checking for duplicates
data[data.duplicated()]
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response |
|---|
0 rows × 27 columns
# Checking for unique values
data.nunique()
ID 2240 Year_Birth 59 Education 5 Marital_Status 8 Income 1974 Kidhome 3 Teenhome 3 Dt_Customer 663 Recency 100 MntWines 776 MntFruits 158 MntMeatProducts 558 MntFishProducts 182 MntSweetProducts 177 MntGoldProds 213 NumDealsPurchases 15 NumWebPurchases 15 NumCatalogPurchases 14 NumStorePurchases 14 NumWebVisitsMonth 16 AcceptedCmp3 2 AcceptedCmp4 2 AcceptedCmp5 2 AcceptedCmp1 2 AcceptedCmp2 2 Complain 2 Response 2 dtype: int64
# Dropping the "ID" column
data = data.drop(['ID'], axis = 1) # axis=1 for columns and axis=0 for rows
data.info() # making sure the 'ID' column got dropped.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year_Birth 2240 non-null int64 1 Education 2240 non-null object 2 Marital_Status 2240 non-null object 3 Income 2216 non-null float64 4 Kidhome 2240 non-null int64 5 Teenhome 2240 non-null int64 6 Dt_Customer 2240 non-null object 7 Recency 2240 non-null int64 8 MntWines 2240 non-null int64 9 MntFruits 2240 non-null int64 10 MntMeatProducts 2240 non-null int64 11 MntFishProducts 2240 non-null int64 12 MntSweetProducts 2240 non-null int64 13 MntGoldProds 2240 non-null int64 14 NumDealsPurchases 2240 non-null int64 15 NumWebPurchases 2240 non-null int64 16 NumCatalogPurchases 2240 non-null int64 17 NumStorePurchases 2240 non-null int64 18 NumWebVisitsMonth 2240 non-null int64 19 AcceptedCmp3 2240 non-null int64 20 AcceptedCmp4 2240 non-null int64 21 AcceptedCmp5 2240 non-null int64 22 AcceptedCmp1 2240 non-null int64 23 AcceptedCmp2 2240 non-null int64 24 Complain 2240 non-null int64 25 Response 2240 non-null int64 dtypes: float64(1), int64(22), object(3) memory usage: 455.1+ KB
# Making a copy of the dataframe
data_bkup = data.copy()
# data = data_bkup.copy()
Questions:
#Summary statistics for numerical nariables
data.describe().round(2).T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year_Birth | 2240.0 | 1968.81 | 11.98 | 1893.0 | 1959.00 | 1970.0 | 1977.00 | 1996.0 |
| Income | 2216.0 | 52247.25 | 25173.08 | 1730.0 | 35303.00 | 51381.5 | 68522.00 | 666666.0 |
| Kidhome | 2240.0 | 0.44 | 0.54 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Teenhome | 2240.0 | 0.51 | 0.54 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Recency | 2240.0 | 49.11 | 28.96 | 0.0 | 24.00 | 49.0 | 74.00 | 99.0 |
| MntWines | 2240.0 | 303.94 | 336.60 | 0.0 | 23.75 | 173.5 | 504.25 | 1493.0 |
| MntFruits | 2240.0 | 26.30 | 39.77 | 0.0 | 1.00 | 8.0 | 33.00 | 199.0 |
| MntMeatProducts | 2240.0 | 166.95 | 225.72 | 0.0 | 16.00 | 67.0 | 232.00 | 1725.0 |
| MntFishProducts | 2240.0 | 37.53 | 54.63 | 0.0 | 3.00 | 12.0 | 50.00 | 259.0 |
| MntSweetProducts | 2240.0 | 27.06 | 41.28 | 0.0 | 1.00 | 8.0 | 33.00 | 263.0 |
| MntGoldProds | 2240.0 | 44.02 | 52.17 | 0.0 | 9.00 | 24.0 | 56.00 | 362.0 |
| NumDealsPurchases | 2240.0 | 2.33 | 1.93 | 0.0 | 1.00 | 2.0 | 3.00 | 15.0 |
| NumWebPurchases | 2240.0 | 4.08 | 2.78 | 0.0 | 2.00 | 4.0 | 6.00 | 27.0 |
| NumCatalogPurchases | 2240.0 | 2.66 | 2.92 | 0.0 | 0.00 | 2.0 | 4.00 | 28.0 |
| NumStorePurchases | 2240.0 | 5.79 | 3.25 | 0.0 | 3.00 | 5.0 | 8.00 | 13.0 |
| NumWebVisitsMonth | 2240.0 | 5.32 | 2.43 | 0.0 | 3.00 | 6.0 | 7.00 | 20.0 |
| AcceptedCmp3 | 2240.0 | 0.07 | 0.26 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp4 | 2240.0 | 0.07 | 0.26 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp5 | 2240.0 | 0.07 | 0.26 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp1 | 2240.0 | 0.06 | 0.25 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp2 | 2240.0 | 0.01 | 0.11 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Complain | 2240.0 | 0.01 | 0.10 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Response | 2240.0 | 0.15 | 0.36 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
#Summary statistics for non-numerical nariables
data.describe(exclude = 'number').T
| count | unique | top | freq | |
|---|---|---|---|---|
| Education | 2240 | 5 | Graduation | 1127 |
| Marital_Status | 2240 | 8 | Married | 864 |
| Dt_Customer | 2240 | 663 | 31-08-2012 | 12 |
# ed_values = data.Education.unique() # unique values in the Education column#
# print(ed_values)
# Let's take closer look at the categorical variables
cat_var = ['Education', 'Marital_Status']
# Counting each unique value in each column
for column in cat_var:
print(data[column].value_counts())
print('-' * 50)
Graduation 1127 PhD 486 Master 370 2n Cycle 203 Basic 54 Name: Education, dtype: int64 -------------------------------------------------- Married 864 Together 580 Single 480 Divorced 232 Widow 77 Alone 3 Absurd 2 YOLO 2 Name: Marital_Status, dtype: int64 --------------------------------------------------
# Merging categories in Education column
data = data.replace(['2n Cycle'], 'Master')
# Merging categories in Marital_Status column
data = data.replace(['Divorced', 'Widow', 'Alone', 'Absurd', 'YOLO'], 'Single')
# Let's make sure the replace command did its job.
cat_var = ['Education', 'Marital_Status']
# Counting each unique value in each column
for column in cat_var:
print(data[column].value_counts())
print('-' * 50)
Graduation 1127 Master 573 PhD 486 Basic 54 Name: Education, dtype: int64 -------------------------------------------------- Married 864 Single 796 Together 580 Name: Marital_Status, dtype: int64 --------------------------------------------------
# Let's explore someother variables like Kidhome, Teenhome, Complain & Response
cat_var = ['Kidhome', 'Teenhome', 'Complain', 'Response']
# Counting each unique value in each column
for column in cat_var:
print(data[column].value_counts())
print('-' * 50)
0 1293 1 899 2 48 Name: Kidhome, dtype: int64 -------------------------------------------------- 0 1158 1 1030 2 52 Name: Teenhome, dtype: int64 -------------------------------------------------- 0 2219 1 21 Name: Complain, dtype: int64 -------------------------------------------------- 0 1906 1 334 Name: Response, dtype: int64 --------------------------------------------------
Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.
Leading Questions:
# Plotting HIST and BOX plots for each variable
#hist_cols = data.select_dtypes(include = np.number)
hist_cols = ['Income', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumWebVisitsMonth']
#for col in data.columns:
for col in hist_cols:
print('Histogram and Boxplot for',col)
print('Skew :', round(data[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
# For histogram
data[col].hist()
plt.ylabel('count') # Add "count" to Y axis
plt.axvline(data[col].mean(), color = 'r', linestyle = '--') # Add mean to the histogram
mean = data[col].mean()
plt.text(mean, plt.ylim()[1]*0.9, f'mean={mean:.2f}', color = 'r', fontsize=10)
plt.axvline(data[col].median(), color = 'black', linestyle = '-') # Add median to the histogram
median = data[col].median()
plt.text(median, plt.ylim()[1]*0.8, f'median={median:.2f}', color = 'black', fontsize=10)
# For Box plot
plt.subplot(1, 2, 2)
sns.boxplot(x = data[col])
#adding labels
plt.text(y=0.45, x=data[col].min(), s='min')
plt.text(y=0.45, x=data[col].quantile(0.25), s='Q1')
plt.text(y=0.45, x=data[col].median(), s='Q2')
plt.text(y=0.49, x=data[col].median(), s='median')
plt.text(y=0.45, x=data[col].quantile(0.75), s='Q3')
plt.text(y=0.45, x=data[col].max(), s='max')
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1
upper_whisker = Q3 + 1.5 * IQR
lower_whisker = Q1 - 1.5 * IQR
p99_5 = data[col].quantile(0.995)
print(f"Upper Whisker: {upper_whisker} & Lower Whisker: {lower_whisker}")
#upper_whisker_percentile = data[col][data[col] <= upper_whisker].quantile(0.99)
#print("Upper Whisker percentile value is: ", upper_whisker_percentile)
print("99.5 percentile value: ", round(p99_5, 2))
num_outliers = data[data[col] > upper_whisker][col].count()
row_count = data[col].notnull().sum()
#print("Row Count", row_count)
print(f"Number of Outliers > Upper Whisker value are: ", num_outliers)
print("Those Outliers percentage is :", round((num_outliers/row_count), 4))
num_outliers = data[data[col] > p99_5][col].count()
#print("Row Count", row_count)
print(f"Number of Outliers > 99.5 percentile value are: ", num_outliers)
print("Those Outliers percentage is :", round((num_outliers/row_count), 4))
#plt.text(y=0.45, x=lower_whisker, s='lw')
plt.text(y=0.45, x=upper_whisker, s='uw')
plt.show()
Histogram and Boxplot for Income Skew : 6.76 Upper Whisker: 118350.5 & Lower Whisker: -14525.5 99.5 percentile value: 102145.75 Number of Outliers > Upper Whisker value are: 8 Those Outliers percentage is : 0.0036 Number of Outliers > 99.5 percentile value are: 12 Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntWines Skew : 1.18 Upper Whisker: 1225.0 & Lower Whisker: -697.0 99.5 percentile value: 1373.15 Number of Outliers > Upper Whisker value are: 35 Those Outliers percentage is : 0.0156 Number of Outliers > 99.5 percentile value are: 12 Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntFruits Skew : 2.1 Upper Whisker: 81.0 & Lower Whisker: -47.0 99.5 percentile value: 184.8 Number of Outliers > Upper Whisker value are: 227 Those Outliers percentage is : 0.1013 Number of Outliers > 99.5 percentile value are: 12 Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntMeatProducts Skew : 2.08 Upper Whisker: 556.0 & Lower Whisker: -308.0 99.5 percentile value: 950.02 Number of Outliers > Upper Whisker value are: 175 Those Outliers percentage is : 0.0781 Number of Outliers > 99.5 percentile value are: 12 Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntFishProducts Skew : 1.92 Upper Whisker: 120.5 & Lower Whisker: -67.5 99.5 percentile value: 241.61 Number of Outliers > Upper Whisker value are: 223 Those Outliers percentage is : 0.0996 Number of Outliers > 99.5 percentile value are: 12 Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntSweetProducts Skew : 2.14 Upper Whisker: 81.0 & Lower Whisker: -47.0 99.5 percentile value: 191.8 Number of Outliers > Upper Whisker value are: 248 Those Outliers percentage is : 0.1107 Number of Outliers > 99.5 percentile value are: 12 Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntGoldProds Skew : 1.89 Upper Whisker: 126.5 & Lower Whisker: -61.5 99.5 percentile value: 241.0 Number of Outliers > Upper Whisker value are: 207 Those Outliers percentage is : 0.0924 Number of Outliers > 99.5 percentile value are: 11 Those Outliers percentage is : 0.0049
Histogram and Boxplot for NumDealsPurchases Skew : 2.42 Upper Whisker: 6.0 & Lower Whisker: -2.0 99.5 percentile value: 12.0 Number of Outliers > Upper Whisker value are: 86 Those Outliers percentage is : 0.0384 Number of Outliers > 99.5 percentile value are: 10 Those Outliers percentage is : 0.0045
Histogram and Boxplot for NumWebPurchases Skew : 1.38 Upper Whisker: 12.0 & Lower Whisker: -4.0 99.5 percentile value: 11.0 Number of Outliers > Upper Whisker value are: 4 Those Outliers percentage is : 0.0018 Number of Outliers > 99.5 percentile value are: 4 Those Outliers percentage is : 0.0018
Histogram and Boxplot for NumCatalogPurchases Skew : 1.88 Upper Whisker: 10.0 & Lower Whisker: -6.0 99.5 percentile value: 11.0 Number of Outliers > Upper Whisker value are: 23 Those Outliers percentage is : 0.0103 Number of Outliers > 99.5 percentile value are: 4 Those Outliers percentage is : 0.0018
Histogram and Boxplot for NumWebVisitsMonth Skew : 0.21 Upper Whisker: 13.0 & Lower Whisker: -3.0 99.5 percentile value: 9.8 Number of Outliers > Upper Whisker value are: 8 Those Outliers percentage is : 0.0036 Number of Outliers > 99.5 percentile value are: 12 Those Outliers percentage is : 0.0054
Income variable is right skewed distribution in the dataset. There are only 12 outliers whose value is greater than the 99.5 percentile value of the column. Since there are only 12 in number and percentage wise only 0.5% (0.0054), so would delete these 12 outlier observations. Looking at the Box plot for the Income variable, the upper wishker is falling at 118350, with mean = 52247 and std (sigma) = 25173. Z is coming to 2.69 and the upper wishker falling at 99.64 percentile, which is NORTH of 2sigma (std).
MntWines is right skewed distribution. Similar to Income varaible, this one also has 12 outliers whose value is grreater than the 99.5 percentile value.
NumWebPurchases meana dn median are so close, almost tending to be normal distribution. Also only four outliers with no significant impact.
NumWebVisitsMonth is left skewed distribution with mean < median. Like others only 12 outliers whose value is greater than the 99.5 percentile value.
#data = data_bkup.copy()
Q1 = data["Income"].quantile(0.25)
Q3 = data["Income"].quantile(0.75)
IQR = Q3 - Q1
upper_whisker = Q3 + 1.5 * IQR
lower_whisker = Q1 - 1.5 * IQR
p99_5 = data["Income"].quantile(0.995)
#outliers_index = data[data["MntWines"] > p99_5].index
outliers_index = data[data["Income"] > upper_whisker].index
print("column outliers index: ", outliers_index)
print(np.size(outliers_index))
column outliers index: Int64Index([164, 617, 655, 687, 1300, 1653, 2132, 2233], dtype='int64') 8
# deleting above rows
data.drop(outliers_index, inplace = True)
data.info() #checking 8 rows got deleted
<class 'pandas.core.frame.DataFrame'> Int64Index: 2232 entries, 0 to 2239 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year_Birth 2232 non-null int64 1 Education 2232 non-null object 2 Marital_Status 2232 non-null object 3 Income 2208 non-null float64 4 Kidhome 2232 non-null int64 5 Teenhome 2232 non-null int64 6 Dt_Customer 2232 non-null object 7 Recency 2232 non-null int64 8 MntWines 2232 non-null int64 9 MntFruits 2232 non-null int64 10 MntMeatProducts 2232 non-null int64 11 MntFishProducts 2232 non-null int64 12 MntSweetProducts 2232 non-null int64 13 MntGoldProds 2232 non-null int64 14 NumDealsPurchases 2232 non-null int64 15 NumWebPurchases 2232 non-null int64 16 NumCatalogPurchases 2232 non-null int64 17 NumStorePurchases 2232 non-null int64 18 NumWebVisitsMonth 2232 non-null int64 19 AcceptedCmp3 2232 non-null int64 20 AcceptedCmp4 2232 non-null int64 21 AcceptedCmp5 2232 non-null int64 22 AcceptedCmp1 2232 non-null int64 23 AcceptedCmp2 2232 non-null int64 24 Complain 2232 non-null int64 25 Response 2232 non-null int64 dtypes: float64(1), int64(22), object(3) memory usage: 470.8+ KB
data1 = data.copy() # taking backup
# data = data1.copy()
# Just checking the plots one more time
hist_cols = ['Income', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumWebVisitsMonth']
#for col in data.columns:
for col in hist_cols:
print('Histogram and Boxplot for',col)
print('Skew :', round(data[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
# For histogram
data[col].hist()
plt.ylabel('count') # Add "count" to Y axis
plt.axvline(data[col].mean(), color = 'r', linestyle = '--') # Add mean to the histogram
mean = data[col].mean()
plt.text(mean, plt.ylim()[1]*0.9, f'mean={mean:.2f}', color = 'r', fontsize=10)
plt.axvline(data[col].median(), color = 'black', linestyle = '-') # Add median to the histogram
median = data[col].median()
plt.text(median, plt.ylim()[1]*0.8, f'median={median:.2f}', color = 'black', fontsize=10)
# For Box plot
plt.subplot(1, 2, 2)
sns.boxplot(x = data[col])
#adding labels
plt.text(y=0.45, x=data[col].min(), s='min')
plt.text(y=0.45, x=data[col].quantile(0.25), s='Q1')
plt.text(y=0.45, x=data[col].median(), s='Q2')
plt.text(y=0.49, x=data[col].median(), s='median')
plt.text(y=0.45, x=data[col].quantile(0.75), s='Q3')
plt.text(y=0.45, x=data[col].max(), s='max')
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1
upper_whisker = Q3 + 1.5 * IQR
lower_whisker = Q1 - 1.5 * IQR
p99_5 = data[col].quantile(0.995)
print(f"Upper Whisker: {upper_whisker} & Lower Whisker: {lower_whisker}")
#upper_whisker_percentile = data[col][data[col] <= upper_whisker].quantile(0.99)
#print("Upper Whisker percentile value is: ", upper_whisker_percentile)
print("99.5 percentile value: ", round(p99_5, 2))
num_outliers = data[data[col] > upper_whisker][col].count()
row_count = data[col].notnull().sum()
#print("Row Count", row_count)
print(f"Number of Outliers > Upper Whisker value are: ", num_outliers)
print("Those Outliers percentage is :", round((num_outliers/row_count), 4))
num_outliers = data[data[col] > p99_5][col].count()
#print("Row Count", row_count)
print(f"Number of Outliers > 99.5 percentile value are: ", num_outliers)
print("Those Outliers percentage is :", round((num_outliers/row_count), 4))
#plt.text(y=0.45, x=lower_whisker, s='lw')
plt.text(y=0.45, x=upper_whisker, s='uw')
plt.show()
Histogram and Boxplot for Income Skew : 0.01 Upper Whisker: 117930.375 & Lower Whisker: -14444.625 99.5 percentile value: 95158.57 Number of Outliers > Upper Whisker value are: 0 Those Outliers percentage is : 0.0 Number of Outliers > 99.5 percentile value are: 12 Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntWines Skew : 1.17 Upper Whisker: 1226.5 & Lower Whisker: -697.5 99.5 percentile value: 1374.35 Number of Outliers > Upper Whisker value are: 35 Those Outliers percentage is : 0.0157 Number of Outliers > 99.5 percentile value are: 12 Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntFruits Skew : 2.1 Upper Whisker: 79.5 & Lower Whisker: -44.5 99.5 percentile value: 184.84 Number of Outliers > Upper Whisker value are: 248 Those Outliers percentage is : 0.1111 Number of Outliers > 99.5 percentile value are: 12 Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntMeatProducts Skew : 1.89 Upper Whisker: 554.125 & Lower Whisker: -306.875 99.5 percentile value: 935.84 Number of Outliers > Upper Whisker value are: 173 Those Outliers percentage is : 0.0775 Number of Outliers > 99.5 percentile value are: 12 Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntFishProducts Skew : 1.91 Upper Whisker: 120.5 & Lower Whisker: -67.5 99.5 percentile value: 241.69 Number of Outliers > Upper Whisker value are: 223 Those Outliers percentage is : 0.0999 Number of Outliers > 99.5 percentile value are: 12 Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntSweetProducts Skew : 2.13 Upper Whisker: 83.5 & Lower Whisker: -48.5 99.5 percentile value: 191.84 Number of Outliers > Upper Whisker value are: 240 Those Outliers percentage is : 0.1075 Number of Outliers > 99.5 percentile value are: 12 Those Outliers percentage is : 0.0054
Histogram and Boxplot for MntGoldProds Skew : 1.88 Upper Whisker: 127.125 & Lower Whisker: -61.875 99.5 percentile value: 241.0 Number of Outliers > Upper Whisker value are: 206 Those Outliers percentage is : 0.0923 Number of Outliers > 99.5 percentile value are: 11 Those Outliers percentage is : 0.0049
Histogram and Boxplot for NumDealsPurchases Skew : 2.32 Upper Whisker: 6.0 & Lower Whisker: -2.0 99.5 percentile value: 11.84 Number of Outliers > Upper Whisker value are: 84 Those Outliers percentage is : 0.0376 Number of Outliers > 99.5 percentile value are: 12 Those Outliers percentage is : 0.0054
Histogram and Boxplot for NumWebPurchases Skew : 1.39 Upper Whisker: 12.0 & Lower Whisker: -4.0 99.5 percentile value: 11.0 Number of Outliers > Upper Whisker value are: 4 Those Outliers percentage is : 0.0018 Number of Outliers > 99.5 percentile value are: 4 Those Outliers percentage is : 0.0018
Histogram and Boxplot for NumCatalogPurchases Skew : 1.37 Upper Whisker: 10.0 & Lower Whisker: -6.0 99.5 percentile value: 11.0 Number of Outliers > Upper Whisker value are: 20 Those Outliers percentage is : 0.009 Number of Outliers > 99.5 percentile value are: 1 Those Outliers percentage is : 0.0004
Histogram and Boxplot for NumWebVisitsMonth Skew : 0.22 Upper Whisker: 13.0 & Lower Whisker: -3.0 99.5 percentile value: 9.84 Number of Outliers > Upper Whisker value are: 8 Those Outliers percentage is : 0.0036 Number of Outliers > 99.5 percentile value are: 12 Those Outliers percentage is : 0.0054
# Creating function for barplots
def cat_bar_perc(data, z):
total = len(data[z]) # column length
plt.figure(figsize = (15, 5))
ax = sns.countplot(data[z], palette = 'Paired')
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height() / total) # Percentage of each class
x = p.get_x() + p.get_width() / 2 - 0.05 # plot width
y = p.get_y() + p.get_height() # plot height
ax.annotate(percentage, (x, y), size = 12)
plt.show()
z = ['Education', 'Kidhome', 'Teenhome', 'Complain', 'Response', 'Marital_Status']
for col in z:
cat_bar_perc(data, col)
plt.figure(figsize = (15, 15))
sns.heatmap(data.corr(), annot = True)
plt.show()
Numerical Variables: NumStorePurahases, MntWines, NumCatlogPurchases & MntMeatProducts etc.,
Categorical Variables: Education, Marital_Status, Kidhome & Teenhome etc.,
# Categorical varibales Vs. **NumStorePurchases**
cat_col = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome']
for col in cat_col:
plt.figure(figsize = (10, 10))
#sns.boxplot(data['Education'], data['NumStorePurchases']) # x-axis: categorical variable, y-axis: numerical variable
#sns.boxplot(x = data[col], y = data['NumStorePurchases'])
sns.barplot(x = data[col], y = data['NumStorePurchases'], data= data)
plt.ylabel('NumStorePurchases')
plt.show()
# print(data['Income'].dtype)
# data['Income'] = data['Income'].astype(float)
# print(data['Income'].dtype)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2232 entries, 0 to 2239 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year_Birth 2232 non-null int64 1 Education 2232 non-null object 2 Marital_Status 2232 non-null object 3 Income 2208 non-null float64 4 Kidhome 2232 non-null int64 5 Teenhome 2232 non-null int64 6 Dt_Customer 2232 non-null object 7 Recency 2232 non-null int64 8 MntWines 2232 non-null int64 9 MntFruits 2232 non-null int64 10 MntMeatProducts 2232 non-null int64 11 MntFishProducts 2232 non-null int64 12 MntSweetProducts 2232 non-null int64 13 MntGoldProds 2232 non-null int64 14 NumDealsPurchases 2232 non-null int64 15 NumWebPurchases 2232 non-null int64 16 NumCatalogPurchases 2232 non-null int64 17 NumStorePurchases 2232 non-null int64 18 NumWebVisitsMonth 2232 non-null int64 19 AcceptedCmp3 2232 non-null int64 20 AcceptedCmp4 2232 non-null int64 21 AcceptedCmp5 2232 non-null int64 22 AcceptedCmp1 2232 non-null int64 23 AcceptedCmp2 2232 non-null int64 24 Complain 2232 non-null int64 25 Response 2232 non-null int64 dtypes: float64(1), int64(22), object(3) memory usage: 470.8+ KB
# Categorical variables Vs. **Income**
cat_col = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome']
for col in cat_col:
plt.figure(figsize = (10, 10))
#sns.boxplot(data['Education'], data['NumStorePurchases']) # x-axis: categorical variable, y-axis: numerical variable
#sns.boxplot(x = data[col], y = data['Income'])
sns.barplot(x = data[col], y = data['Income'], data= data)
plt.ylabel('Income')
#plt.xlabel(data[col])
plt.show()
In this section, we will first prepare our dataset for analysis.
# show all 24 rows with NaN in Income column
data[data.isnull()['Income'] == True]
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 1983 | Graduation | Married | NaN | 1 | 0 | 15-11-2013 | 11 | 5 | 5 | ... | 0 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 27 | 1986 | Graduation | Single | NaN | 1 | 0 | 20-02-2013 | 19 | 5 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 43 | 1959 | PhD | Single | NaN | 0 | 0 | 05-11-2013 | 80 | 81 | 11 | ... | 3 | 4 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 48 | 1951 | Graduation | Single | NaN | 2 | 1 | 01-01-2014 | 96 | 48 | 5 | ... | 1 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 58 | 1982 | Graduation | Single | NaN | 1 | 0 | 17-06-2013 | 57 | 11 | 3 | ... | 0 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 71 | 1973 | Master | Married | NaN | 1 | 0 | 14-09-2012 | 25 | 25 | 3 | ... | 0 | 3 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 90 | 1957 | PhD | Married | NaN | 2 | 1 | 19-11-2012 | 4 | 230 | 42 | ... | 2 | 8 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 91 | 1957 | Graduation | Single | NaN | 1 | 1 | 27-05-2014 | 45 | 7 | 0 | ... | 0 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 92 | 1973 | Master | Together | NaN | 0 | 0 | 23-11-2013 | 87 | 445 | 37 | ... | 4 | 8 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 128 | 1961 | PhD | Married | NaN | 0 | 1 | 11-07-2013 | 23 | 352 | 0 | ... | 1 | 7 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 133 | 1963 | Graduation | Married | NaN | 0 | 1 | 11-08-2013 | 96 | 231 | 65 | ... | 5 | 7 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 312 | 1989 | Graduation | Married | NaN | 0 | 0 | 03-06-2013 | 69 | 861 | 138 | ... | 5 | 12 | 3 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 319 | 1970 | Graduation | Single | NaN | 1 | 2 | 23-08-2013 | 67 | 738 | 20 | ... | 3 | 10 | 7 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 1379 | 1970 | Master | Together | NaN | 0 | 1 | 01-04-2013 | 39 | 187 | 5 | ... | 2 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1382 | 1958 | Graduation | Together | NaN | 1 | 1 | 03-09-2012 | 87 | 19 | 4 | ... | 0 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1383 | 1964 | Master | Single | NaN | 1 | 1 | 12-01-2014 | 49 | 5 | 1 | ... | 0 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1386 | 1972 | PhD | Together | NaN | 1 | 0 | 02-03-2014 | 17 | 25 | 1 | ... | 0 | 3 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2059 | 1969 | Master | Together | NaN | 1 | 1 | 18-05-2013 | 52 | 375 | 42 | ... | 10 | 4 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2061 | 1981 | PhD | Single | NaN | 1 | 0 | 31-05-2013 | 82 | 23 | 0 | ... | 0 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2078 | 1971 | Graduation | Married | NaN | 1 | 1 | 03-03-2013 | 82 | 71 | 1 | ... | 1 | 3 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2079 | 1954 | Master | Together | NaN | 0 | 1 | 23-06-2013 | 83 | 161 | 0 | ... | 1 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2081 | 1955 | Graduation | Single | NaN | 0 | 1 | 18-10-2013 | 95 | 264 | 0 | ... | 1 | 5 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2084 | 1943 | Master | Single | NaN | 0 | 0 | 30-10-2013 | 75 | 532 | 126 | ... | 5 | 11 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 2228 | 1978 | Master | Together | NaN | 0 | 0 | 12-08-2012 | 53 | 32 | 2 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
24 rows × 26 columns
#data = data1.copy()
income_mean = data['Income'].mean() #calculate the mean of the column Income
print('Income Column mean: ',income_mean)
print(type(income_mean))
Income Column mean: 51633.63813405797 <class 'float'>
#data['Income'] = data['Income'].replace(np.NaN, 'income_mean') # replace the NaN with calculated mean value
#print(data['Income'].dtype)
Observation:
# will try fillna than replace.
print(data['Income'].dtype)
data['Income'] = data['Income'].fillna(income_mean)
print(data['Income'].dtype)
float64 float64
data[data.isnull()['Income'] == True] # checking for missing values
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response |
|---|
0 rows × 26 columns
data.isnull().sum() # Making sure no missing values in the entire dataset.
Year_Birth 0 Education 0 Marital_Status 0 Income 0 Kidhome 0 Teenhome 0 Dt_Customer 0 Recency 0 MntWines 0 MntFruits 0 MntMeatProducts 0 MntFishProducts 0 MntSweetProducts 0 MntGoldProds 0 NumDealsPurchases 0 NumWebPurchases 0 NumCatalogPurchases 0 NumStorePurchases 0 NumWebVisitsMonth 0 AcceptedCmp3 0 AcceptedCmp4 0 AcceptedCmp5 0 AcceptedCmp1 0 AcceptedCmp2 0 Complain 0 Response 0 dtype: int64
Think About It:
# Engineer "age" variable.
curr_year = 2016
#curr_year
data['age'] = curr_year - data['Year_Birth']
data['age'].sort_values()
1170 20
46 20
696 21
747 21
1850 21
...
424 75
1950 76
192 116
339 117
239 123
Name: age, Length: 2232, dtype: int64
# Rows of column "age" > 122
rowsgt122 = data[data['age'] > 122].index
print(rowsgt122)
Int64Index([239], dtype='int64')
# drop the rows with age > 122
data.drop(rowsgt122, inplace = True)
data['age'].sort_values() # Checking to make sure the identified rows got dropped.
46 20
1170 20
747 21
1850 21
2213 21
...
2084 73
424 75
1950 76
192 116
339 117
Name: age, Length: 2231, dtype: int64
# Engineer the "total_kids" variable
data['total_kids'] = data['Kidhome'] + data['Teenhome']
#How many members each family has? Need some feature engineering.
# Marital_Status + total_kids = family_size
# Merging categories in Marital_Status column
data = data.replace(['Together', 'Married'], 'Relationship')
#Engineering new variable "Rel_Status" with numbers for accumulating purposes
data['Rel_Status'] = data['Marital_Status'].replace({'Single' : 1, 'Relationship' : 2})
#Calculate Family_Size
data['Family_Size'] = data['Rel_Status'] + data['total_kids']
# Total_Amount_Spent by the customers on various products
data['Total_Amount_Spent'] = data['MntFishProducts'] + data['MntMeatProducts'] + data['MntFruits'] + data['MntSweetProducts'] + data['MntWines'] + data['MntGoldProds']
# Total_Amount_Spent by the customers on various products
data['Total_Purchases'] = data['NumDealsPurchases'] + data['NumWebPurchases'] + data['NumCatalogPurchases'] + data['NumStorePurchases']
# How long the customer has been with the company
# Converting Dt_Customer datatype
data.Dt_Customer = pd.to_datetime(data.Dt_Customer)
#data.Dt_Customer[1]
curr_date = pd.to_datetime('01-01-2016').normalize() # current -date
curr_date
data['Enrolled_In_Days'] = (curr_date - data.Dt_Customer) / np.timedelta64(1, 'D') #number of days enrolled
# How many offers the customers have accepted?
data['Total_Offers_Accepted'] = data['AcceptedCmp1'] + data['AcceptedCmp2'] + data['AcceptedCmp3'] + data['AcceptedCmp4'] + data['AcceptedCmp5'] + data['Response']
# Total Amount spent per purchase
#data['Total_Purchases'] = data['NumDealsPurchases'] + data['NumCatalogPurchases'] + data['NumStorePurchases'] + data['NumWebPurchases']
data['Amount_Per_Purchase'] = round((data['Total_Amount_Spent'] / data['Total_Purchases']), 2)
# So describing the data, one more time
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year_Birth | 2231.0 | 1.968827e+03 | 11.885098 | 1899.00 | 1959.00 | 1970.00 | 1977.000 | 1996.0 |
| Income | 2231.0 | 5.162981e+04 | 20605.486323 | 1730.00 | 35428.50 | 51563.00 | 68118.000 | 113734.0 |
| Kidhome | 2231.0 | 4.446437e-01 | 0.538605 | 0.00 | 0.00 | 0.00 | 1.000 | 2.0 |
| Teenhome | 2231.0 | 5.069476e-01 | 0.544702 | 0.00 | 0.00 | 0.00 | 1.000 | 2.0 |
| Recency | 2231.0 | 4.912281e+01 | 28.954103 | 0.00 | 24.00 | 49.00 | 74.000 | 99.0 |
| MntWines | 2231.0 | 3.050632e+02 | 336.801738 | 0.00 | 24.00 | 176.00 | 505.000 | 1493.0 |
| MntFruits | 2231.0 | 2.639220e+01 | 39.826540 | 0.00 | 2.00 | 8.00 | 33.000 | 199.0 |
| MntMeatProducts | 2231.0 | 1.653913e+02 | 219.424360 | 0.00 | 16.00 | 67.00 | 231.500 | 1725.0 |
| MntFishProducts | 2231.0 | 3.765845e+01 | 54.697914 | 0.00 | 3.00 | 12.00 | 50.000 | 259.0 |
| MntSweetProducts | 2231.0 | 2.716764e+01 | 41.330676 | 0.00 | 1.00 | 8.00 | 34.000 | 263.0 |
| MntGoldProds | 2231.0 | 4.418512e+01 | 52.208570 | 0.00 | 9.00 | 25.00 | 56.500 | 362.0 |
| NumDealsPurchases | 2231.0 | 2.318691e+00 | 1.894814 | 0.00 | 1.00 | 2.00 | 3.000 | 15.0 |
| NumWebPurchases | 2231.0 | 4.099059e+00 | 2.774573 | 0.00 | 2.00 | 4.00 | 6.000 | 27.0 |
| NumCatalogPurchases | 2231.0 | 2.637382e+00 | 2.796029 | 0.00 | 0.00 | 2.00 | 4.000 | 28.0 |
| NumStorePurchases | 2231.0 | 5.809951e+00 | 3.241921 | 0.00 | 3.00 | 5.00 | 8.000 | 13.0 |
| NumWebVisitsMonth | 2231.0 | 5.332138e+00 | 2.415652 | 0.00 | 3.00 | 6.00 | 7.000 | 20.0 |
| AcceptedCmp3 | 2231.0 | 7.306141e-02 | 0.260296 | 0.00 | 0.00 | 0.00 | 0.000 | 1.0 |
| AcceptedCmp4 | 2231.0 | 7.485433e-02 | 0.263215 | 0.00 | 0.00 | 0.00 | 0.000 | 1.0 |
| AcceptedCmp5 | 2231.0 | 7.306141e-02 | 0.260296 | 0.00 | 0.00 | 0.00 | 0.000 | 1.0 |
| AcceptedCmp1 | 2231.0 | 6.454505e-02 | 0.245776 | 0.00 | 0.00 | 0.00 | 0.000 | 1.0 |
| AcceptedCmp2 | 2231.0 | 1.299866e-02 | 0.113294 | 0.00 | 0.00 | 0.00 | 0.000 | 1.0 |
| Complain | 2231.0 | 9.412819e-03 | 0.096584 | 0.00 | 0.00 | 0.00 | 0.000 | 1.0 |
| Response | 2231.0 | 1.497087e-01 | 0.356866 | 0.00 | 0.00 | 0.00 | 0.000 | 1.0 |
| age | 2231.0 | 4.717346e+01 | 11.885098 | 20.00 | 39.00 | 46.00 | 57.000 | 117.0 |
| total_kids | 2231.0 | 9.515912e-01 | 0.751910 | 0.00 | 0.00 | 1.00 | 1.000 | 3.0 |
| Rel_Status | 2231.0 | 1.644106e+00 | 0.478891 | 1.00 | 1.00 | 2.00 | 2.000 | 2.0 |
| Family_Size | 2231.0 | 2.595697e+00 | 0.907201 | 1.00 | 2.00 | 3.00 | 3.000 | 5.0 |
| Total_Amount_Spent | 2231.0 | 6.058579e+02 | 601.452000 | 5.00 | 69.00 | 397.00 | 1044.500 | 2525.0 |
| Total_Purchases | 2231.0 | 1.486508e+01 | 7.621082 | 0.00 | 8.00 | 15.00 | 21.000 | 43.0 |
| Enrolled_In_Days | 2231.0 | 9.030309e+02 | 232.194880 | 391.00 | 732.00 | 904.00 | 1077.000 | 1454.0 |
| Total_Offers_Accepted | 2231.0 | 4.482295e-01 | 0.889846 | 0.00 | 0.00 | 0.00 | 1.000 | 5.0 |
| Amount_Per_Purchase | 2231.0 | inf | NaN | 0.53 | 9.71 | 23.38 | 45.455 | inf |
col2_dtype = data['Amount_Per_Purchase'].dtype
print(f"The datatype of 'Amount_Per_Purchase' is {col2_dtype}")
inf_count = np.isinf(data['Amount_Per_Purchase']).sum()
print(f"There are {inf_count} infinity values in 'Amount_Per_Purchase'")
The datatype of 'Amount_Per_Purchase' is float64 There are 2 infinity values in 'Amount_Per_Purchase'
# find column(s) with infinity values
inf_cols = data.isin([np.inf, -np.inf]).any()
# get column name(s) with infinity values
inf_col_names = inf_cols[inf_cols == True].index.tolist()
print(f"The following column(s) contain infinity values: {inf_col_names}")
The following column(s) contain infinity values: ['Amount_Per_Purchase']
# get rows with infinity values
inf_rows = data.isin([np.inf, -np.inf]).any(axis=1)
inf_df = data[inf_rows]
print(f"The following rows contain infinity values: \n{inf_df}")
The following rows contain infinity values:
Year_Birth Education Marital_Status Income Kidhome Teenhome \
981 1965 Graduation Single 4861.0 0 0
1524 1973 Graduation Single 3502.0 1 0
Dt_Customer Recency MntWines MntFruits ... Response age \
981 2014-06-22 20 2 1 ... 0 51
1524 2013-04-13 56 2 1 ... 0 43
total_kids Rel_Status Family_Size Total_Amount_Spent \
981 0 1 1 6
1524 1 1 2 5
Total_Purchases Enrolled_In_Days Total_Offers_Accepted \
981 0 558.0 0
1524 0 993.0 0
Amount_Per_Purchase
981 inf
1524 inf
[2 rows x 35 columns]
rows_index = data.index[data.isin([np.inf, -np.inf]).any(1)]
print(rows_index)
Int64Index([981, 1524], dtype='int64')
# Dropping the above rows
data.drop(rows_index, inplace = True)
#print("Dataframe after deleting rows with infinity values:")
#print(data)
# MAking sure no more infinity values
# find column(s) with infinity values
inf_cols = data.isin([np.inf, -np.inf]).any()
# get column name(s) with infinity values
inf_col_names = inf_cols[inf_cols == True].index.tolist()
print(f"The following column(s) contain infinity values: {inf_col_names}")
The following column(s) contain infinity values: []
**No column names displayed. So infinity values gone.
What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?
pre_processed_data = data.copy() # just trying to save the effort incase "data" df got corrupted
# Dropping all the redundant information columns
processed_data = data.drop(["Year_Birth", "Education", "Marital_Status", "Dt_Customer",
"AcceptedCmp1", "AcceptedCmp2", "AcceptedCmp3", "AcceptedCmp4", "AcceptedCmp5",
"Response", "Complain", "Family_Size", "total_kids"
],
axis=1, inplace = True
)
#processed_data = data.drop(["Income", "age", "Family_Size"], axis=1, inplace = True)
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Income | 2229.0 | 51672.380132 | 20565.602889 | 1730.00 | 35523.00 | 51569.00 | 68118.00 | 113734.0 |
| Kidhome | 2229.0 | 0.444594 | 0.538636 | 0.00 | 0.00 | 0.00 | 1.00 | 2.0 |
| Teenhome | 2229.0 | 0.507402 | 0.544735 | 0.00 | 0.00 | 0.00 | 1.00 | 2.0 |
| Recency | 2229.0 | 49.132795 | 28.960156 | 0.00 | 24.00 | 49.00 | 74.00 | 99.0 |
| MntWines | 2229.0 | 305.335128 | 336.830396 | 0.00 | 24.00 | 177.00 | 505.00 | 1493.0 |
| MntFruits | 2229.0 | 26.414984 | 39.837141 | 0.00 | 2.00 | 8.00 | 33.00 | 199.0 |
| MntMeatProducts | 2229.0 | 165.538807 | 219.467513 | 0.00 | 16.00 | 68.00 | 232.00 | 1725.0 |
| MntFishProducts | 2229.0 | 37.691790 | 54.711120 | 0.00 | 3.00 | 12.00 | 50.00 | 259.0 |
| MntSweetProducts | 2229.0 | 27.192014 | 41.341203 | 0.00 | 1.00 | 8.00 | 34.00 | 263.0 |
| MntGoldProds | 2229.0 | 44.223867 | 52.215955 | 0.00 | 9.00 | 25.00 | 57.00 | 362.0 |
| NumDealsPurchases | 2229.0 | 2.320772 | 1.894389 | 0.00 | 1.00 | 2.00 | 3.00 | 15.0 |
| NumWebPurchases | 2229.0 | 4.102737 | 2.773097 | 0.00 | 2.00 | 4.00 | 6.00 | 27.0 |
| NumCatalogPurchases | 2229.0 | 2.639749 | 2.796166 | 0.00 | 0.00 | 2.00 | 4.00 | 28.0 |
| NumStorePurchases | 2229.0 | 5.815164 | 3.238697 | 0.00 | 3.00 | 5.00 | 8.00 | 13.0 |
| NumWebVisitsMonth | 2229.0 | 5.324361 | 2.402730 | 0.00 | 3.00 | 6.00 | 7.00 | 20.0 |
| age | 2229.0 | 47.173620 | 11.889826 | 20.00 | 39.00 | 46.00 | 57.00 | 117.0 |
| Rel_Status | 2229.0 | 1.644684 | 0.478716 | 1.00 | 1.00 | 2.00 | 2.00 | 2.0 |
| Total_Amount_Spent | 2229.0 | 606.396590 | 601.452740 | 8.00 | 69.00 | 397.00 | 1045.00 | 2525.0 |
| Total_Purchases | 2229.0 | 14.878421 | 7.611471 | 1.00 | 8.00 | 15.00 | 21.00 | 43.0 |
| Enrolled_In_Days | 2229.0 | 903.145357 | 232.176186 | 391.00 | 732.00 | 904.00 | 1077.00 | 1454.0 |
| Total_Offers_Accepted | 2229.0 | 0.448632 | 0.890144 | 0.00 | 0.00 | 0.00 | 1.00 | 5.0 |
| Amount_Per_Purchase | 2229.0 | 33.304800 | 45.063796 | 0.53 | 9.71 | 23.35 | 45.35 | 1679.0 |
data_model = data.copy() # just trying to save the effort incase "data" df got corrupted
data_model.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2229 entries, 0 to 2239 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Income 2229 non-null float64 1 Kidhome 2229 non-null int64 2 Teenhome 2229 non-null int64 3 Recency 2229 non-null int64 4 MntWines 2229 non-null int64 5 MntFruits 2229 non-null int64 6 MntMeatProducts 2229 non-null int64 7 MntFishProducts 2229 non-null int64 8 MntSweetProducts 2229 non-null int64 9 MntGoldProds 2229 non-null int64 10 NumDealsPurchases 2229 non-null int64 11 NumWebPurchases 2229 non-null int64 12 NumCatalogPurchases 2229 non-null int64 13 NumStorePurchases 2229 non-null int64 14 NumWebVisitsMonth 2229 non-null int64 15 age 2229 non-null int64 16 Rel_Status 2229 non-null int64 17 Total_Amount_Spent 2229 non-null int64 18 Total_Purchases 2229 non-null int64 19 Enrolled_In_Days 2229 non-null float64 20 Total_Offers_Accepted 2229 non-null int64 21 Amount_Per_Purchase 2229 non-null float64 dtypes: float64(3), int64(19) memory usage: 400.5 KB
#Just curious to see how all these variables are correlated to each other
plt.figure(figsize = (15, 15))
sns.heatmap(data.corr(), annot = True)
plt.show()
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns = data.columns) # Can use data_model or data
# interchangebly. Since both of same data.
data_scaled.head()
| Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | age | Rel_Status | Total_Amount_Spent | Total_Purchases | Enrolled_In_Days | Total_Offers_Accepted | Amount_Per_Purchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.314461 | -0.825592 | -0.931676 | 0.306255 | 0.978946 | 1.546266 | 1.733954 | 2.455412 | 1.471211 | 0.838555 | ... | 2.632856 | -0.560587 | 0.697546 | 0.994887 | -1.346994 | 1.680648 | 1.330078 | 1.976764 | 0.619554 | 0.696396 |
| 1 | -0.259150 | 1.031365 | 0.904492 | -0.384504 | -0.874034 | -0.638115 | -0.727099 | -0.652514 | -0.633699 | -0.732199 | ... | -0.586559 | -1.178258 | -0.135027 | 1.247260 | -1.346994 | -0.963545 | -1.166714 | -1.667838 | -0.504112 | -0.639344 |
| 2 | 0.969828 | -0.825592 | -0.931676 | -0.798959 | 0.358317 | 0.567061 | -0.175641 | 1.340215 | -0.149812 | -0.042599 | ... | -0.228846 | 1.292426 | -0.551314 | 0.321892 | 0.742394 | 0.282053 | 0.804437 | -0.172948 | -0.504112 | 0.080908 |
| 3 | -1.217178 | 1.031365 | -0.931676 | -0.798959 | -0.874034 | -0.562792 | -0.663294 | -0.506259 | -0.585311 | -0.751354 | ... | -0.944272 | -0.560587 | 0.281260 | -1.276472 | 0.742394 | -0.920306 | -0.903894 | -1.926320 | -0.504112 | -0.592289 |
| 4 | 0.321999 | 1.031365 | -0.931676 | 1.549621 | -0.392972 | 0.416414 | -0.216658 | 0.151890 | -0.004646 | -0.559799 | ... | 0.128866 | 0.057084 | -0.135027 | -1.024098 | 0.742394 | -0.306654 | 0.541617 | -0.823462 | -0.504112 | -0.246257 |
5 rows × 22 columns
tsne = TSNE(n_components = 2, random_state = 1) ## Apply the t-SNE algorithm with random_state = 1
data_tsne = tsne.fit_transform(data_scaled) # # Fit and transform T-SNE function on the scaled data
data_tsne.shape
(2229, 2)
data_tsne = pd.DataFrame(data = data_tsne, columns = ['Component 1', 'Component 2'])
data_tsne.head()
| Component 1 | Component 2 | |
|---|---|---|
| 0 | -17.536951 | -43.880280 |
| 1 | 35.045784 | -29.042521 |
| 2 | -34.295082 | -10.124798 |
| 3 | 44.926903 | 48.946674 |
| 4 | 20.414270 | 42.588531 |
# Plot a scatter plot to ee the groupings
sns.scatterplot(x = data_tsne.iloc[:,0], y = data_tsne.iloc[:,1])
plt.show()
Observation and Insights:
sns.scatterplot(x = data_tsne.iloc[:,0], y = data_tsne.iloc[:,1], hue = data.Income)
plt.show()
Trying to visualize data with different perplexity values
for i in range(10, 50, 5): # starting with 10 and up to 50 in increments of 5
tsne = TSNE(n_components = 2, random_state = 1, perplexity = i) # Applying tSNE with perpelxity range 10 thru 50
data_tsne = tsne.fit_transform(data_scaled) # Fit and transform T-SNE function on the scaled data
data_tsne = pd.DataFrame(data_tsne) # Converting embeddings to dataframe
data_tsne.columns = ['X1', 'X2']
plt.figure(figsize = (10,10))
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne)
plt.title("perplexity = {}".format(i))
Even though I see seven, but I am hoping they will form three or four resonable size clusters!!! Let's see!!! what PCA tells us.
Think about it:
# Figuring out the number of principle components to generate
n = data_scaled.shape[1]
print('Number of Principal Components: ', n)
Number of Principal Components: 22
# Principal Components for the data
# Applying PCA algorith with random_state=1
pca = PCA(n_components = n, random_state = 1)
data_pca = pd.DataFrame(pca.fit_transform(data_scaled)) # Fit and transform the pca function on scaled data
# The percentage of variance explained by each principal component
exp_var = pca.explained_variance_ratio_
exp_var.cumsum()
array([0.38335424, 0.48732342, 0.55357456, 0.60720205, 0.65353071,
0.69897098, 0.73946305, 0.77362121, 0.80576543, 0.83576611,
0.86108828, 0.88568982, 0.90569495, 0.92453842, 0.94219911,
0.95873953, 0.97304848, 0.98359186, 0.9934839 , 1. ,
1. , 1. ])
# Visualizing the explained variance by individual components
plt.figure(figsize = (10, 10))
plt.plot(range(1, 23), exp_var.cumsum(), marker = 'o', linestyle = '--')
plt.title("Explained Variance by Components")
plt.xlabel("Number of Components")
plt.ylabel('Cumulative Explained Variance')
plt.show()
# Finding the least number of components that can explain more than 90% variance.
sum = 0
for ix, i in enumerate(exp_var): #its the same or similar code as the elective project.
sum = sum + i
if(sum > 0.90):
print("Numbers of PCs that explain at least 90% of varince: ", ix + 1)
break
Numbers of PCs that explain at least 90% of varince: 13
85% - 11 PCs
90% - 13 PCs
97% - 17 PCs
# Visualizing the explained variance by individual components
plt.figure(figsize = (10, 10))
# Function Definition to add vlue labels to the individual BARs in the plot
def addvaluelabels(x,y):
for i in range(len(x)):
plt.text(i+1, y[i]/2, y[i], ha = 'center') #trying to align to center of the bar
plt.bar(range(1, 23), exp_var, alpha = 0.5, align = 'center', label = 'Individual Explained Variance')
plt.plot(range(1, 23), exp_var.cumsum(), marker = 'o', linestyle = '--', label='Cumulative Explained Variance')
#Add value labels
x = range(1,23)
y = np.around(exp_var, 2)
addvaluelabels(x, y)
plt.title("Explained Variance by Components")
plt.xlabel("Number of Components")
plt.ylabel('Cumulative Explained Variance')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()
Observation and Insights: No visibile break on the plot to eyeball the number of components. As mentioned above will try with 13 PCs.
# Creating a new dataframe with first 13 principal components as columns and original features as indices
cols = ['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10', 'PC11', 'PC12', 'PC13']
df_pc1 = pd.DataFrame(np.round(pca.components_.T[:, 0:13], 2), index = data_scaled.columns, columns = cols)
df_pc1
def color_high(val):
if val <= -0.30:
return 'background: pink'
elif val >= 0.30:
return 'background: skyblue'
df_pc1.style.applymap(color_high)
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | PC13 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Income | 0.300000 | 0.010000 | -0.180000 | 0.080000 | -0.020000 | 0.020000 | -0.110000 | 0.030000 | -0.070000 | -0.040000 | 0.030000 | -0.140000 | 0.060000 |
| Kidhome | -0.220000 | -0.010000 | 0.240000 | -0.050000 | -0.050000 | 0.020000 | -0.300000 | 0.480000 | 0.040000 | -0.240000 | -0.110000 | 0.080000 | 0.070000 |
| Teenhome | -0.030000 | -0.440000 | -0.390000 | 0.030000 | 0.030000 | -0.020000 | 0.080000 | 0.190000 | 0.010000 | 0.130000 | 0.170000 | -0.630000 | -0.050000 |
| Recency | 0.000000 | -0.010000 | -0.050000 | -0.290000 | 0.830000 | 0.250000 | -0.330000 | -0.090000 | 0.120000 | -0.080000 | 0.060000 | -0.040000 | -0.060000 |
| MntWines | 0.280000 | -0.120000 | 0.050000 | 0.310000 | 0.080000 | 0.110000 | -0.100000 | -0.090000 | -0.110000 | 0.010000 | 0.020000 | 0.030000 | -0.120000 |
| MntFruits | 0.230000 | 0.130000 | 0.060000 | -0.300000 | -0.100000 | -0.100000 | 0.090000 | 0.160000 | 0.050000 | -0.160000 | 0.210000 | -0.110000 | -0.720000 |
| MntMeatProducts | 0.280000 | 0.170000 | 0.080000 | 0.040000 | 0.070000 | 0.070000 | -0.010000 | 0.250000 | -0.170000 | 0.040000 | -0.080000 | 0.120000 | 0.070000 |
| MntFishProducts | 0.240000 | 0.140000 | 0.060000 | -0.300000 | -0.070000 | -0.080000 | 0.130000 | 0.170000 | 0.090000 | -0.170000 | -0.030000 | 0.010000 | -0.080000 |
| MntSweetProducts | 0.230000 | 0.110000 | 0.060000 | -0.290000 | -0.070000 | -0.080000 | 0.050000 | 0.090000 | 0.060000 | -0.280000 | 0.420000 | -0.200000 | 0.570000 |
| MntGoldProds | 0.200000 | -0.090000 | 0.120000 | -0.160000 | -0.000000 | -0.110000 | 0.140000 | -0.050000 | 0.730000 | 0.390000 | -0.320000 | -0.040000 | 0.040000 |
| NumDealsPurchases | -0.030000 | -0.510000 | 0.170000 | -0.110000 | -0.070000 | -0.030000 | -0.220000 | 0.430000 | -0.060000 | -0.000000 | -0.210000 | 0.000000 | 0.030000 |
| NumWebPurchases | 0.200000 | -0.350000 | 0.090000 | -0.020000 | -0.090000 | -0.060000 | -0.110000 | -0.240000 | 0.160000 | 0.060000 | 0.470000 | 0.330000 | 0.110000 |
| NumCatalogPurchases | 0.290000 | 0.010000 | -0.000000 | 0.030000 | 0.020000 | 0.020000 | -0.070000 | -0.010000 | 0.020000 | -0.150000 | -0.360000 | 0.000000 | 0.110000 |
| NumStorePurchases | 0.260000 | -0.150000 | -0.090000 | -0.110000 | -0.110000 | -0.010000 | -0.160000 | -0.230000 | -0.320000 | 0.030000 | -0.140000 | 0.000000 | -0.130000 |
| NumWebVisitsMonth | -0.210000 | -0.280000 | 0.370000 | 0.080000 | 0.050000 | 0.010000 | 0.100000 | -0.000000 | 0.080000 | -0.060000 | 0.320000 | 0.190000 | -0.220000 |
| age | 0.060000 | -0.210000 | -0.510000 | 0.140000 | 0.180000 | -0.010000 | 0.430000 | 0.210000 | 0.150000 | -0.390000 | -0.070000 | 0.440000 | 0.000000 |
| Rel_Status | -0.010000 | -0.030000 | -0.050000 | -0.130000 | -0.340000 | 0.920000 | 0.060000 | -0.030000 | 0.140000 | -0.050000 | 0.000000 | -0.010000 | -0.010000 |
| Total_Amount_Spent | 0.330000 | 0.010000 | 0.080000 | 0.110000 | 0.050000 | 0.060000 | -0.020000 | 0.070000 | -0.040000 | 0.010000 | -0.010000 | 0.040000 | -0.060000 |
| Total_Purchases | 0.280000 | -0.320000 | 0.040000 | -0.070000 | -0.090000 | -0.030000 | -0.190000 | -0.090000 | -0.090000 | -0.020000 | -0.080000 | 0.120000 | 0.030000 |
| Enrolled_In_Days | 0.040000 | -0.220000 | 0.470000 | -0.080000 | 0.230000 | 0.090000 | 0.600000 | -0.150000 | -0.290000 | -0.120000 | -0.190000 | -0.210000 | 0.100000 |
| Total_Offers_Accepted | 0.140000 | 0.060000 | 0.230000 | 0.610000 | 0.020000 | 0.030000 | -0.120000 | -0.020000 | 0.340000 | -0.370000 | 0.030000 | -0.320000 | -0.060000 |
| Amount_Per_Purchase | 0.210000 | 0.130000 | 0.080000 | 0.210000 | 0.180000 | 0.160000 | 0.170000 | 0.460000 | -0.110000 | 0.540000 | 0.250000 | 0.100000 | 0.050000 |
Think About It:
# Copy of the scaled data to store Labels from each algorithm
# Copy - original object plus reference address
# copy (deep = True) - Original object plus repetitive copies also stored
data_pca_copy = data_pca.copy(deep = True)
# Step 1
# Dictionary to store SSE (Sum of Squared Error). SSE is nothing but Distortions
sse = {}
# Step 2
# Iterate for a range of Ks and fit the pca components to the algorithm
for k in range(1, 10):
kmeans = KMeans(n_clusters = k, random_state = 1).fit(data_pca)
sse[k] = kmeans.inertia_ # Store the inertia value for K
# Step 3
# Elbow plot
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()), 'bx-') # 'bx-' ==> blue x
plt.xlabel('Number of Clusters')
plt.ylabel('SSE')
plt.show()
Observation:
# Silhouette Score
# Dictionary to store Silhouette Score for each value of K
sil_score = {}
# Iterate for a range of Ks and fit the PCs to the algorithm.
for k in range(2, 10):
kmeans = KMeans(n_clusters = k, random_state = 1).fit(data_pca)
labels = kmeans.predict(data_pca)
sil_score[k] = silhouette_score(data_pca, labels)
# Elbow Plot
plt.figure()
plt.plot(list(sil_score.keys()), list(sil_score.values()), 'rx-')
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")
#plt.text(list(sil_score.keys()), sil_score, list(sil_score.values()), ha='center')
plt.show()
#K-Means Silhouette Score @k=4
kmeans_sc = silhouette_score(data_pca, labels)
print("K-Means Silhouette Score @k=4: ", round(kmeans_sc, 4))
K-Means Silhouette Score @k=4: 0.1076
K-Means Cluster Profiling with k=4
kmeans = KMeans(n_clusters = 4, random_state = 1) # Applying the KMeans algorithm
kmeans.fit(data_pca) # fir the kmeans functiomn on the scaled data
# Adding the preicted labels to the copied data and the orifginal data
data_pca_copy['Labels_k4'] = kmeans.labels_ # Save the predictions on the pca components from K-Means
data_model['Labels_k4'] = kmeans.labels_
data_pca['Labels_k4'] = kmeans.labels_
# Number of observations in each cluster
data_model['Labels_k4'].value_counts()
2 580 1 561 3 547 0 541 Name: Labels_k4, dtype: int64
Wow!! How I stumbled into such equal distributions!!!
# Calculating summary statistics of the original data for each label
mean = data_model.groupby('Labels_k4').mean()
median = data_model.groupby('Labels_k4').median()
df_kmeans = pd.concat([mean, median], axis = 0)
df_kmeans.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_3 Mean',
'group_0 Median', 'group_1 Median', 'group_2 Median', 'group_3 Median']
df_kmeans.T
| group_0 Mean | group_1 Mean | group_2 Mean | group_3 Mean | group_0 Median | group_1 Median | group_2 Median | group_3 Median | |
|---|---|---|---|---|---|---|---|---|
| Income | 76434.511188 | 29304.335456 | 58988.666322 | 42364.727797 | 76982.00 | 29478.00 | 59422.000 | 42557.0 |
| Kidhome | 0.035120 | 0.855615 | 0.217241 | 0.669104 | 0.00 | 1.00 | 0.000 | 1.0 |
| Teenhome | 0.147874 | 0.016043 | 0.884483 | 0.967093 | 0.00 | 0.00 | 1.000 | 1.0 |
| Recency | 50.164510 | 48.171123 | 47.667241 | 50.652651 | 53.00 | 47.00 | 48.000 | 51.0 |
| MntWines | 631.449168 | 31.197861 | 485.991379 | 72.396709 | 575.00 | 12.00 | 423.000 | 44.0 |
| MntFruits | 68.469501 | 6.058824 | 27.401724 | 4.652651 | 54.00 | 3.00 | 16.000 | 2.0 |
| MntMeatProducts | 468.961183 | 24.953654 | 149.025862 | 27.137112 | 430.00 | 15.00 | 131.000 | 17.0 |
| MntFishProducts | 100.959335 | 8.976827 | 35.770690 | 6.605119 | 93.00 | 4.00 | 20.000 | 3.0 |
| MntSweetProducts | 71.393715 | 6.094474 | 27.437931 | 4.851920 | 58.00 | 3.00 | 15.000 | 2.0 |
| MntGoldProds | 77.704251 | 17.040998 | 66.412069 | 15.462523 | 57.00 | 11.00 | 47.000 | 9.0 |
| NumDealsPurchases | 1.181146 | 1.868093 | 3.720690 | 2.427788 | 1.00 | 1.00 | 3.000 | 2.0 |
| NumWebPurchases | 5.036969 | 2.190731 | 6.712069 | 2.372943 | 5.00 | 2.00 | 7.000 | 2.0 |
| NumCatalogPurchases | 6.048059 | 0.491979 | 3.289655 | 0.782450 | 6.00 | 0.00 | 3.000 | 1.0 |
| NumStorePurchases | 8.316081 | 3.074866 | 8.150000 | 3.676417 | 8.00 | 3.00 | 8.000 | 3.0 |
| NumWebVisitsMonth | 2.746765 | 6.985740 | 5.668966 | 5.804388 | 2.00 | 7.00 | 6.000 | 6.0 |
| age | 47.578558 | 37.787879 | 50.827586 | 52.524680 | 47.00 | 38.00 | 51.000 | 52.0 |
| Rel_Status | 1.611830 | 1.643494 | 1.660345 | 1.661792 | 2.00 | 2.00 | 2.000 | 2.0 |
| Total_Amount_Spent | 1418.937153 | 94.322638 | 792.039655 | 131.106033 | 1383.00 | 60.00 | 732.000 | 90.0 |
| Total_Purchases | 20.582255 | 7.625668 | 21.872414 | 9.259598 | 20.00 | 7.00 | 21.000 | 9.0 |
| Enrolled_In_Days | 904.826248 | 896.778966 | 960.756897 | 846.925046 | 908.00 | 901.00 | 977.500 | 828.0 |
| Total_Offers_Accepted | 1.055453 | 0.221034 | 0.400000 | 0.133455 | 1.00 | 0.00 | 0.000 | 0.0 |
| Amount_Per_Purchase | 74.297079 | 11.142139 | 36.335310 | 12.278720 | 67.59 | 8.62 | 33.225 | 11.0 |
cols_visualise = ['Income','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts',
'MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases',
'NumWebVisitsMonth','Total_Amount_Spent', 'Total_Purchases','Total_Offers_Accepted']
for col in cols_visualise:
sns.boxplot(x = 'Labels_k4', y = col, data = data_model)
plt.show()
#Visualize Clusters using PCA
#cols_visualise = ['Income','Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts',
# 'MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases',
# 'NumWebVisitsMonth','Total_Amount_Spent', 'Total_Purchases','Total_Offers_Accepted']
#for col in cols_visualise:
#sns.scatterplot(x = col, y = 'Income', data = data_model, hue = 'Labels_k4', palette = 'Dark2')
sns.scatterplot(x = 0, y = 1, data = data_pca_copy, hue = 'Labels_k4', palette = 'Dark2')
plt.show()
Think About It:
This group_3 seems to be most affluent group out of the four groups I was able to discover. Their mean income is above 75K and spending more on Wines, meat and fish products. This group doing more shopping at stores, follwed by catalog purchases and web purchases. Also this group has accepted more offers offered than any other groups. They have less kids when compared to the other groups. When compared to toher groups their number of web visits are less, but they are doing resonable number purchases on web, after store and catalog.
From Income perspective this is the next group with more income after group_3. This group's mean income is about 60K. This group has more teens compared the younger kids. This group also spending more wine and meat products. For this group third in line spending is gold products. This group spedning more on store and web purchases.
This group seems to have both young kids and teen kids. Within their means they are also spending on wines, meat and gold products. Spending less on fruit and sweet products. They are visiting web more, but their shopping spread almost equally between store, web and deal purchases. In comparison, this group accepted very less number of offers.
This is the youngest and less affluent group when compared to the other groups with almost no teens and with very few young kids. They are also spending on gold products along with wine and meat products, than on fish products. This group visits web more often than other groups, but their spenign is very prudent, in the order of store, web abd deal purchases.
**Let's see if other clustering algorithms yield anymore insights than K-Means
#data_pca = data_pca1.copy()
#data_pca1 = data_pca.drop(["Labels_k4","kmedoLabels_k4"], axis=1)
#data_model.info()
kmedo = KMedoids(n_clusters = 4, random_state = 1) #_ Apply the K-Medoids algorithm on the pca components with n_components=2 and random_state=1
kmedo.fit(data_pca) #_ Fit the model on the pca components
data_pca_copy['kmedoLabels_k4'] = kmedo.predict(data_pca)
data_model['kmedoLabels_k4'] = kmedo.predict(data_pca)
#data_pca['kmedoLabels_k4'] = kmedo.predict(data_pca)
#K-Medoids Silhouette Score
labels = kmedo.predict(data_pca)
kmedo_sc = silhouette_score(data_pca, labels)
print("K-Medoids Silhouette Score: ", round(kmedo_sc, 4))
K-Medoids Silhouette Score: 0.1071
# Number of observations in each cluster
data_model.kmedoLabels_k4.value_counts()
2 772 3 584 1 485 0 388 Name: kmedoLabels_k4, dtype: int64
# Calculating summary statistics of the original data for each label
cols_visualise = ['Income','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts',
'MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases',
'NumWebVisitsMonth','Total_Amount_Spent', 'Total_Purchases','Total_Offers_Accepted']
mean = data_model.groupby('kmedoLabels_k4').mean()
median = data_model.groupby('kmedoLabels_k4').median()
df_kmedoids = pd.concat([mean, median], axis = 0)
df_kmedoids.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_3 Mean',
'group_0 Median', 'group_1 Median', 'group_2 Median', 'group_3 Median']
df_kmedoids[cols_visualise].T
| group_0 Mean | group_1 Mean | group_2 Mean | group_3 Mean | group_0 Median | group_1 Median | group_2 Median | group_3 Median | |
|---|---|---|---|---|---|---|---|---|
| Income | 29432.234315 | 34812.944547 | 55362.002727 | 75572.404436 | 30130.5 | 34738.0 | 55760.0 | 76063.0 |
| MntWines | 39.376289 | 24.688660 | 374.356218 | 623.864726 | 18.0 | 15.0 | 298.5 | 567.5 |
| MntFruits | 7.018041 | 3.070103 | 18.718912 | 68.863014 | 4.0 | 2.0 | 10.0 | 54.5 |
| MntMeatProducts | 28.636598 | 15.047423 | 112.845855 | 451.130137 | 17.0 | 10.0 | 87.5 | 415.0 |
| MntFishProducts | 10.422680 | 4.369072 | 25.380829 | 99.756849 | 6.0 | 2.0 | 13.0 | 90.0 |
| MntSweetProducts | 6.739691 | 3.259794 | 18.424870 | 72.244863 | 4.0 | 2.0 | 9.5 | 58.5 |
| MntGoldProds | 20.685567 | 9.525773 | 51.987047 | 78.416096 | 12.5 | 6.0 | 34.0 | 57.0 |
| NumDealsPurchases | 2.319588 | 1.647423 | 3.494819 | 1.328767 | 2.0 | 1.0 | 3.0 | 1.0 |
| NumWebPurchases | 2.417526 | 1.635052 | 5.549223 | 5.359589 | 2.0 | 1.0 | 5.0 | 5.0 |
| NumCatalogPurchases | 0.579897 | 0.331959 | 2.645078 | 5.917808 | 0.0 | 0.0 | 2.0 | 6.0 |
| NumStorePurchases | 3.121134 | 3.008247 | 7.010363 | 8.356164 | 3.0 | 3.0 | 7.0 | 8.0 |
| NumWebVisitsMonth | 7.546392 | 5.925773 | 5.634715 | 2.938356 | 8.0 | 6.0 | 6.0 | 2.0 |
| Total_Amount_Spent | 112.878866 | 59.960825 | 601.713731 | 1394.275685 | 73.0 | 46.0 | 530.0 | 1368.5 |
| Total_Purchases | 8.438144 | 6.622680 | 18.699482 | 20.962329 | 8.0 | 6.0 | 19.0 | 21.0 |
| Total_Offers_Accepted | 0.273196 | 0.107216 | 0.347150 | 0.982877 | 0.0 | 0.0 | 0.0 | 0.0 |
Groups labels have shifted between groups!!! Like K-Means I was expecting group_3 and group_0 to be affluent. But looking at below PCA visualization, it confirms that the clusters did not change, but only the labels have shifted.
#Visualize Clusters using PCA
#cols_visualise = ['Total_Amount_Spent', 'Total_Offers_Accepted', 'Total_Purchases', 'Enrolled_In_Days', 'Recency', 'Amount_Per_Purchase']
#for col in cols_visualise:
sns.scatterplot(x = 0, y = 1, data = data_pca_copy, hue = 'kmedoLabels_k4', palette = 'Dark2')
plt.show()
for col in cols_visualise:
sns.boxplot(x = 'kmedoLabels_k4', y = col, data = data_model)
plt.show()
Observations and Insights:
Summary for each cluster:
This group_3 seems to be most affluent group out of the four groups. Their mean income is about 75K and spending more on Wines and meat products. The next tier of shopping is on Fish, Gold, Sweet and fruit products. This group doing more shopping at stores, follwed by catalog purchases and web purchases. Also this group has accepted more offers offered than any other groups. When compared to other groups their number of web visits are less, but they are eaually spening on Web and catalog products, after store purchases.
From Income perspective this is the next group with more income after group_3. This group's mean income is around 55k. This group also spending more wine and meat products. For this group third in line spending is gold products. This group spedning more on store and web purchases.
This group's mean income is around 34K. Within their means they are also spending on wines, meat and gold products. Spending less on fruit and sweet products. They are 2nd highest among the groups in visiting the web, but their shopping spread almost equally between web and deal purchases. Store purchases are slightly higher than web abd deal purchases. In comparison, this group accepted very less number of offers.
This is less affluent group when compared to the other groups with mean income around 29K. They are also spending on gold products along with wine and meat products. This group visits web more often than any other groups, but their spending is very prudent, in the order of store, web abd deal purchases. Their usage of catalog channel is very low.
**With this K-Medoids algorithm also, same customer behavior observed as ealier in K-Means algorithm. Also observed that the label values are little bit smaller when compared to K-Means label values.
**All the customer trends, with samller numbers, are same between K_Means and K-Medoids.
# List of all linkage methods to check
linkage_methods = ['single', 'average', 'complete']
# list of distance metrics
distance_metrics = ["euclidean", "cityblock", "correlation", "cosine"]
high_cophenet_corr = 0 # Creating a variable by assigning 0 to it
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
# Z - linkage matrix
Z = linkage(data_pca, metric=dm, method=lm) # Applying different linkages with different distances on data_pca1
#c - cophenet correlation
[c, cophenet_dist] = cophenet(Z, pdist(data_pca)) # Z - output of linkage function.
# cophenet_dist - height of the link
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c: # Checking if cophenetic correlation is higher than previous score
high_cophenet_corr = c # Appending to high_cophenet_corr list if it is higher
high_dm_lm[0] = dm # Appending its corresponding distance
high_dm_lm[1] = lm # Appending its corresponding method or linkage
Cophenetic correlation for Euclidean distance and single linkage is 0.7316482357048057. Cophenetic correlation for Euclidean distance and average linkage is 0.8415725173893369. Cophenetic correlation for Euclidean distance and complete linkage is 0.7709704860518053. Cophenetic correlation for Cityblock distance and single linkage is 0.7547761040359486. Cophenetic correlation for Cityblock distance and average linkage is 0.8387365549064711. Cophenetic correlation for Cityblock distance and complete linkage is 0.8084480191971283. Cophenetic correlation for Correlation distance and single linkage is 0.5003386528997378. Cophenetic correlation for Correlation distance and average linkage is 0.594236294654903. Cophenetic correlation for Correlation distance and complete linkage is 0.5325071285023306. Cophenetic correlation for Cosine distance and single linkage is 0.5043462132260496. Cophenetic correlation for Cosine distance and average linkage is 0.5885397187935416. Cophenetic correlation for Cosine distance and complete linkage is 0.554279256538745.
# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.8415725173893369, which is obtained with Euclidean distance and average linkage.
Dendograms for different Linkages with euclidean distance
# Linkage methods list
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
comp_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size (15, 30)
# Loop thru all the above list of linkage methods
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="euclidean", method=method) # distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # dendrogram title
[cophenet_corr, cophenet_dist] = cophenet(Z, pdist(data_pca)) # Calculating cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{cophenet_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Now with Cityblock
# Linkage methods list
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
comp_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(20,30)) # Setting the plot size (15, 30)
# Loop thru all the above list of linkage methods
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="Cityblock", method=method) # distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # dendrogram title
[cophenet_corr, cophenet_dist] = cophenet(Z, pdist(data_pca)) # Calculating cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{cophenet_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Think about it:
# Clustering with 4 clusters
hierarchical = AgglomerativeClustering(n_clusters = 4, affinity = 'CityBlock', linkage = 'average')
hierarchical.fit(data_pca)
AgglomerativeClustering(affinity='CityBlock', linkage='average', n_clusters=4)
data_pca_copy['HCLabels_k4'] = hierarchical.labels_
data_model['HCLabels_k4'] = hierarchical.labels_
# Hierarchical Clustering Silhouette Score
labels = hierarchical.fit_predict(data_pca)
kmedo_sc = silhouette_score(data_pca, labels)
print("Hierarchical Clustering Silhouette Score: ", round(kmedo_sc, 4))
Hierarchical Clustering Silhouette Score: 0.4721
data_model.HCLabels_k4.value_counts()
0 2223 1 4 2 1 3 1 Name: HCLabels_k4, dtype: int64
# Checking for HCLabels == 1
#data[data.HCLabels == 1]
cols_visualise = ['Income','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts',
'MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases',
'NumWebVisitsMonth','Total_Amount_Spent', 'Total_Purchases','Total_Offers_Accepted']
mean = data_model.groupby('HCLabels_k4').mean()
median = data_model.groupby('HCLabels_k4').median()
df_hierachical = pd.concat([mean, median], axis = 0)
df_hierachical.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_3 Mean',
'group_0 Median', 'group_1 Median', 'group_2 Median', 'group_3 Median']
df_hierachical[cols_visualise].T
| group_0 Mean | group_1 Mean | group_2 Mean | group_3 Mean | group_0 Median | group_1 Median | group_2 Median | group_3 Median | |
|---|---|---|---|---|---|---|---|---|
| Income | 51707.923994 | 44234.909534 | 2447.0 | 51633.638134 | 51569.0 | 29388.819067 | 2447.0 | 51633.638134 |
| MntWines | 306.095816 | 27.000000 | 1.0 | 32.000000 | 179.0 | 11.000000 | 1.0 | 32.000000 |
| MntFruits | 26.479982 | 2.750000 | 1.0 | 2.000000 | 8.0 | 3.000000 | 1.0 | 2.000000 |
| MntMeatProducts | 164.463788 | 12.750000 | 1725.0 | 1607.000000 | 68.0 | 7.500000 | 1725.0 | 1607.000000 |
| MntFishProducts | 37.782726 | 2.750000 | 1.0 | 12.000000 | 12.0 | 2.500000 | 1.0 | 12.000000 |
| MntSweetProducts | 27.024291 | 132.750000 | 1.0 | 4.000000 | 8.0 | 133.000000 | 1.0 | 4.000000 |
| MntGoldProds | 43.893387 | 244.250000 | 1.0 | 22.000000 | 25.0 | 306.000000 | 1.0 | 22.000000 |
| NumDealsPurchases | 2.320288 | 0.000000 | 15.0 | 0.000000 | 2.0 | 0.000000 | 15.0 | 0.000000 |
| NumWebPurchases | 4.067926 | 25.500000 | 0.0 | 0.000000 | 4.0 | 26.000000 | 0.0 | 0.000000 |
| NumCatalogPurchases | 2.633828 | 0.250000 | 28.0 | 0.000000 | 2.0 | 0.000000 | 28.0 | 0.000000 |
| NumStorePurchases | 5.829960 | 0.250000 | 0.0 | 1.000000 | 5.0 | 0.000000 | 0.0 | 1.000000 |
| NumWebVisitsMonth | 5.336932 | 0.750000 | 1.0 | 0.000000 | 6.0 | 1.000000 | 1.0 | 0.000000 |
| Total_Amount_Spent | 605.739991 | 422.250000 | 1730.0 | 1679.000000 | 397.0 | 387.500000 | 1730.0 | 1679.000000 |
| Total_Purchases | 14.852002 | 26.000000 | 43.0 | 1.000000 | 15.0 | 26.000000 | 43.0 | 1.000000 |
| Total_Offers_Accepted | 0.449393 | 0.000000 | 0.0 | 1.000000 | 0.0 | 0.000000 | 0.0 | 1.000000 |
#Visualize Clusters using PCA
#cols_visualise = ['Total_Amount_Spent', 'Total_Offers_Accepted', 'Total_Purchases', 'Enrolled_In_Days', 'Recency', 'Amount_Per_Purchase']
#for col in cols_visualise:
sns.scatterplot(x = 0, y = 1, data = data_pca_copy, hue = 'HCLabels_k4', palette = 'Dark2')
plt.show()
Hmmm!!! Scatter plot says its one big cluster.
for col in cols_visualise:
sns.boxplot(x = 'HCLabels_k4', y = col, data = data_model)
plt.show()
Observations and Insights: Can't deduce any insights, since no clusters.
Summary of each cluster:
DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.
Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.
# Range for eps and min-samples
eps_range = np.arange(1, 2)
min_samples_range = range(2, 12)
# Load your data here or create a random dataset for testing
#X = np.random.randn(100, 2)
# Create an empty array to store the silhouette scores
silhouette_scores = np.zeros((len(eps_range), len(min_samples_range)))
# Loop over all combinations of eps and min-samples
for i, eps in enumerate(eps_range):
for j, min_samples in enumerate(min_samples_range):
# Create a DBSCAN object with the current hyperparameters
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
# Fit the DBSCAN model to the data
dbscan.fit(data_pca)
# Calculate the silhouette score for the current clustering
if len(set(dbscan.labels_)) > 1: # check if there's more than one cluster
silhouette_scores[i, j] = silhouette_score(data_pca, dbscan.labels_)
else:
silhouette_scores[i, j] = -1 # set the score to -1 if there's only one cluster
# Print the silhouette scores for each combination of hyperparameters
print("Silhouette scores:")
for i, eps in enumerate(eps_range):
for j, min_samples in enumerate(min_samples_range):
print(f"eps={eps}, min_samples={min_samples}: {silhouette_scores[i, j]:.3f}")
Silhouette scores: eps=1, min_samples=2: -0.254 eps=1, min_samples=3: -0.355 eps=1, min_samples=4: -0.227 eps=1, min_samples=5: -0.234 eps=1, min_samples=6: -0.217 eps=1, min_samples=7: -0.206 eps=1, min_samples=8: -0.210 eps=1, min_samples=9: -0.114 eps=1, min_samples=10: -0.114 eps=1, min_samples=11: -0.145
# Find the highest silhouette score and its hyperparameters
best_score = np.max(silhouette_scores)
best_eps_idx, best_min_samples_idx = np.unravel_index(np.argmax(silhouette_scores), silhouette_scores.shape)
best_eps, best_min_samples = eps_range[best_eps_idx], min_samples_range[best_min_samples_idx]
# Print the best hyperparameters and the highest silhouette score
print(f"The best hyperparameters are eps={best_eps} and min_samples={best_min_samples}, with a silhouette score of {best_score:.3f}.")
The best hyperparameters are eps=1 and min_samples=10, with a silhouette score of -0.114.
# Create a DBSCAN object with the best hyperparameters from above
eps = 1
min_samples = 8
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
# Fit the DBSCAN model to the data
dbscan.fit(data_pca)
data_pca_copy['DBSLabels_k4'] = dbscan.fit_predict(data_pca)
data_model['DBSLabels_k4'] = dbscan.fit_predict(data_pca)
# DBSCAN Clustering Silhouette Score
labels = dbscan.fit_predict(data_pca)
dbscan_sc = silhouette_score(data_pca, labels)
print("DBSCAN Clustering Silhouette Score: ", round(dbscan_sc, 4))
DBSCAN Clustering Silhouette Score: -0.2097
data_model['DBSLabels_k4'].value_counts()
-1 2142 0 58 1 13 2 8 3 8 Name: DBSLabels_k4, dtype: int64
# Calculating the mean & median
cols_visualise = ['Income','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts',
'MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases',
'NumWebVisitsMonth','Total_Amount_Spent', 'Total_Purchases','Total_Offers_Accepted']
mean = data_model.groupby('DBSLabels_k4').mean()
median = data_model.groupby('DBSLabels_k4').median()
df_hierachical = pd.concat([mean, median], axis = 0)
df_hierachical.index = ['group_-1 Mean', 'group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_3 Mean',
'group_-1 Median', 'group_0 Median', 'group_1 Median', 'group_2 Median', 'group_3 Median']
df_hierachical[cols_visualise].T
| group_-1 Mean | group_0 Mean | group_1 Mean | group_2 Mean | group_3 Mean | group_-1 Median | group_0 Median | group_1 Median | group_2 Median | group_3 Median | |
|---|---|---|---|---|---|---|---|---|---|---|
| Income | 52520.696861 | 29596.304106 | 29956.538462 | 27722.375 | 43825.375 | 52372.5 | 28691.0 | 30992.0 | 27663.5 | 45494.0 |
| MntWines | 317.171802 | 10.655172 | 10.692308 | 12.875 | 43.750 | 198.0 | 7.0 | 9.0 | 11.0 | 41.5 |
| MntFruits | 27.371615 | 3.258621 | 2.923077 | 2.250 | 0.500 | 9.0 | 2.5 | 2.0 | 1.0 | 0.0 |
| MntMeatProducts | 171.810458 | 11.706897 | 11.384615 | 7.750 | 9.875 | 73.0 | 11.0 | 13.0 | 7.5 | 10.5 |
| MntFishProducts | 39.057890 | 4.431034 | 4.769231 | 3.625 | 0.625 | 13.0 | 3.0 | 3.0 | 3.0 | 0.0 |
| MntSweetProducts | 28.187208 | 2.706897 | 4.153846 | 2.500 | 0.375 | 9.0 | 1.5 | 3.0 | 2.5 | 0.0 |
| MntGoldProds | 45.756303 | 7.068966 | 6.538462 | 5.125 | 3.625 | 26.0 | 6.0 | 5.0 | 5.0 | 4.0 |
| NumDealsPurchases | 2.360878 | 1.310345 | 1.153846 | 1.125 | 2.000 | 2.0 | 1.0 | 1.0 | 1.0 | 2.0 |
| NumWebPurchases | 4.211485 | 1.413793 | 1.384615 | 1.125 | 1.875 | 4.0 | 1.0 | 1.0 | 1.0 | 2.0 |
| NumCatalogPurchases | 2.744631 | 0.068966 | 0.000000 | 0.000 | 0.125 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| NumStorePurchases | 5.939309 | 2.672414 | 2.769231 | 2.875 | 3.250 | 5.0 | 3.0 | 3.0 | 3.0 | 3.0 |
| NumWebVisitsMonth | 5.252101 | 7.293103 | 7.076923 | 7.000 | 5.875 | 6.0 | 7.0 | 7.0 | 7.0 | 6.0 |
| Total_Amount_Spent | 629.355275 | 39.827586 | 40.461538 | 34.125 | 58.750 | 425.0 | 37.0 | 43.0 | 34.0 | 54.5 |
| Total_Purchases | 15.256303 | 5.465517 | 5.307692 | 5.125 | 7.250 | 16.0 | 5.0 | 5.0 | 5.0 | 7.0 |
| Total_Offers_Accepted | 0.466853 | 0.000000 | 0.000000 | 0.000 | 0.000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
#Visualize Clusters using PCA
#cols_visualise = ['Total_Amount_Spent', 'Total_Offers_Accepted', 'Total_Purchases', 'Enrolled_In_Days', 'Recency', 'Amount_Per_Purchase']
#for col in cols_visualise:
sns.scatterplot(x = 0, y = 1, data = data_pca_copy, hue = 'DBSLabels_k4', palette = 'Dark2')
plt.show()
Again no clusters to analyse.
for col in cols_visualise:
sns.boxplot(x = 'DBSLabels_k4', y = col, data = data_model)
plt.show()
Observations and Insights:
Think about it:
Yes, changing the eps & min_sample will result in different results. Since its a trail and error, it might take so many iterations to reach clear demarkation of clusters. Also I am running out of time to finish up and submit.
Summary of each cluster:
# GMM initialized with clusters = 4 and random_state = 1
gmm = GaussianMixture(n_components = 4, random_state = 1)
gmm.fit(data_pca) # Fitting the model
labels = gmm.predict(data_pca)
data_pca_copy['GmmLabels_k4'] = gmm.predict(data_pca)
data_model['GmmLabels_k4'] = gmm.predict(data_pca)
silhouette_scores = silhouette_score(data_pca, labels)
print("GMM Silhouette Score: ", round(silhouette_scores, 3))
GMM Silhouette Score: 0.181
data_model.GmmLabels_k4.value_counts()
0 580 3 561 1 547 2 541 Name: GmmLabels_k4, dtype: int64
cols_visualise = ['Income','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts',
'MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases',
'NumWebVisitsMonth','Total_Amount_Spent', 'Total_Purchases','Total_Offers_Accepted']
mean = data_model.groupby('GmmLabels_k4').mean()
median = data_model.groupby('GmmLabels_k4').median()
df_gmm = pd.concat([mean, median], axis = 0)
df_gmm.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean','group_3 Mean',
'group_0 Median', 'group_1 Median', 'group_2 Median', 'group_3 Median']
df_gmm[cols_visualise].T
| group_0 Mean | group_1 Mean | group_2 Mean | group_3 Mean | group_0 Median | group_1 Median | group_2 Median | group_3 Median | |
|---|---|---|---|---|---|---|---|---|
| Income | 58988.666322 | 42364.727797 | 76434.511188 | 29304.335456 | 59422.0 | 42557.0 | 76982.0 | 29478.0 |
| MntWines | 485.991379 | 72.396709 | 631.449168 | 31.197861 | 423.0 | 44.0 | 575.0 | 12.0 |
| MntFruits | 27.401724 | 4.652651 | 68.469501 | 6.058824 | 16.0 | 2.0 | 54.0 | 3.0 |
| MntMeatProducts | 149.025862 | 27.137112 | 468.961183 | 24.953654 | 131.0 | 17.0 | 430.0 | 15.0 |
| MntFishProducts | 35.770690 | 6.605119 | 100.959335 | 8.976827 | 20.0 | 3.0 | 93.0 | 4.0 |
| MntSweetProducts | 27.437931 | 4.851920 | 71.393715 | 6.094474 | 15.0 | 2.0 | 58.0 | 3.0 |
| MntGoldProds | 66.412069 | 15.462523 | 77.704251 | 17.040998 | 47.0 | 9.0 | 57.0 | 11.0 |
| NumDealsPurchases | 3.720690 | 2.427788 | 1.181146 | 1.868093 | 3.0 | 2.0 | 1.0 | 1.0 |
| NumWebPurchases | 6.712069 | 2.372943 | 5.036969 | 2.190731 | 7.0 | 2.0 | 5.0 | 2.0 |
| NumCatalogPurchases | 3.289655 | 0.782450 | 6.048059 | 0.491979 | 3.0 | 1.0 | 6.0 | 0.0 |
| NumStorePurchases | 8.150000 | 3.676417 | 8.316081 | 3.074866 | 8.0 | 3.0 | 8.0 | 3.0 |
| NumWebVisitsMonth | 5.668966 | 5.804388 | 2.746765 | 6.985740 | 6.0 | 6.0 | 2.0 | 7.0 |
| Total_Amount_Spent | 792.039655 | 131.106033 | 1418.937153 | 94.322638 | 732.0 | 90.0 | 1383.0 | 60.0 |
| Total_Purchases | 21.872414 | 9.259598 | 20.582255 | 7.625668 | 21.0 | 9.0 | 20.0 | 7.0 |
| Total_Offers_Accepted | 0.400000 | 0.133455 | 1.055453 | 0.221034 | 0.0 | 0.0 | 1.0 | 0.0 |
for col in cols_visualise:
sns.boxplot(x = 'GmmLabels_k4', y = col, data = data_model)
plt.show()
Observations and Insights:
**Similar behavior found with K-Medoids algorithm also.
#for col in cols_visualise:
sns.scatterplot(x = 0, y = 1, data = data_pca_copy, hue = 'GmmLabels_k4', palette = 'Dark2')
plt.show()
Observations and Insights:
**Results from this algorithm are very much looking like K-Means & K-Medoids algorithms.
Summary of each cluster:
This group_2 seems to be most affluent group out of the four groups. Their mean income is about 76K and spending more on Wines and meat products. The next tier of shopping is on Fish, Gold, Sweet and fruit products. This group doing more shopping at stores, followed by catalog purchases and web purchases. Also this group has accepted more offers offered than any other groups. When compared to other groups their number of web visits are less, but they are eaually spending on Web and catalog products, after store purchases.
From Income perspective this is the next group with more income after group_2. This group's mean income is around 59k. This group also spending more wine and meat products. For this group third in line spending is gold products. This group spending more on store and web purchases and then catalog and deals.
This group's mean income is around 42K. Within their means they are also spending on wines, meat and gold products. Spending less on fruit and sweet products. They are 2nd highest among the groups in visiting the web, but their shopping spread almost equally between store and web purchases, followed by deal purchases. In comparison, this group accepted very less number of offers.
This is less affluent group when compared to the other groups with mean income around 29K. Their top spending is on wines, folloed by meat and then gold products. This group visits web more often than any other groups, but their spending is very prudent, in the order of store, deals and then web purchases. Their usage of catalog channel is negligible.
Identify and focus on the big picture first and all of its components
These components are usually the driving force for the end goal
Summarize the most important findings and takeaways in the beginning
Steps that can taken to improve the solution
How to make the best of the solution?
What are the steps to be followed by the stakeholders
The reason for the proposed solution design
How it would affect the problem/business?
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
Silhouette Scores:
I was expecting K-Medoids or GMM to perform better, but surprise Hierarchical Clustering algorithm has highest Silhouette Score when compared to K-Means, K-Medoids and GMM. But remember Hierarchical has 2223 observations in one group and 4, 1, 1 observatiosn in rest of three groups. Similarly DBSCAN has 2142 in one group and 58, 13, 8, 8 in other respective groups. One way to look at them is so many outliers. From that perspective I will not select Hierarchical & DBSCAN for this given dataset.
Feature enhancement as well as inclusion may improve the information in the dataset, may reduce the outliers/noise. After few trials and errors, and right parameter tuning might help us to use and interpret Hierarchical Agglomerative Clustering algorithm also on this given dataset.
2. Refined insights:
3. Proposal for the final solution design:
What are some key recommendations to implement the solution?
What are the key actionables for stakeholders?
What is the expected benefit and/or costs?
List the benefits of the solution
Take some rational assumptions to put forward some numbers on costs/benefits for stakeholders
*Estimated Costs: Data cleaning and preprocessing: 4,000 Clustering algorithm development and testing: 12,000 Data visualization and interpretation: 6,000 Targeted marketing campaign development: 20,000 Implementation and monitoring: 25,000 Total cost: 67,000*
*Estimated Benefits: Increased customer engagement: 10% increase in website visits and 5% increase in average purchase amount. Increased sales: 500,000 dollars in additional revenue over the next 12 months. Assuming that the company has a profit margin of 10%*
*Net Estimated Benefit: Net benefit = (Revenue - Costs) Profit margin Net benefit = (500,000 - 67,000) 10% Net benefit = 43,300 Based on these assumptions, the clustering solution would generate a net benefit of 43,300 dollars over the next 12 months. This suggests that the clustering solution is a worthwhile investment for the company, as the benefits outweigh the costs. However, it is important to note that these are hypothetical estimates*.
What are the key risks and challenges?
What further analysis needs to be done or what other associated problems need to be solved?